This assignment is for ETC5521 Assignment 1 by Team Hakea comprising of Dang Thanh Nguyen, Rui Min Lin, (Siddhant V Tirodkar), and (Varsha Ujjinni Vijay Kumar).
People drink coffee for various reasons: Either to warm their soul in a chilling morning, or to stay awake and focus after a long, restless night. Whatever the reason is, it is clear that we love the dark, fragrance liquid. Some people loves coffee for its pleasant aroma. Some others loves its unique flavor notes. Some simply loves it because it is as black as their soul. But most would agree that nothing is better than a cup of great coffee to start a wonderful, productive day.
Great coffee beans produce great coffee cup. The question is, what do we, coffee lovers, should consider when planning to buy a premium bag of the charming black beans? Hence, in this report. We will have a journey to the place where the coffee trees were raised and the beans were harvested, to see what affects the quality of our beloved coffee.
The data is originally observed from Coffee Quality Institute website and was scraped by James DeLoux, the data was then re-posted on Kaggle. Furthermore, it is then cleaned by Thomas Mock.
The cleaned data set consists of 1339 observations and 43 variables. There are some potential limitations are:
As the number of graded coffee beans differ largely from country to country, some of the analysis will be biased.
For US, there are 3 areas that produce coffee beans: Mainland, Puerto Rico and Hawaii. In this research, the researchers merge all this areas together to better represent the country.
There are outliers in data, which may require further cleaning before certain analysis can take place.
There are different in scale system in the data. In detail, it is unclear whether the altitude of a farm was converted from meter to feet.
Some values of the column harvest year are poorly recorded (Such as 08/09 crop, or 1t/2011). Hence, in this report, I used grading date and accept the slight time error.
Data wrangling and cleaning is crucial to produce an exploratory data analysis fluently. The original data is a data frame scraped by James LeDoux in January 2018 from the Coffee Quality Institute website which has a few missing values columns within it, so the author has cleaned the data set by removing the variables: “view_certificate_1”, “view_certificate_2”,etc. On the other hand, there are two separate data set raw_robusta and raw_arabica originally. Thomas Mock first cleaned the variable names in both data set with function janitor::clean_names, and inappropriate data class is corrected using col_double, col_character etc. Variables like salt_acid, bitter_sweet, fragrance_aroma, mouthfeel, and uniform_cup is renamed to acidity, sweetness, aroma, body and uniformity respectively, to allow a better understanding for readers.
The data sets were then joined by implementing the function bind_rows to produce the merged data set, which is exported to a single csv file “coffee_ratings.csv” with 1339 observations and 43 variables.
After this knowing what each of those variables define with respect to our topic is important so below is the description of variables included in the data set:
| Variable | Class | Description |
|---|---|---|
| total_cup_points | double | Total rating/points (0 - 100 scale) |
| species | character | Species of coffee bean (arabica or robusta) |
| owner | character | Owner of the farm |
| country_of_origin | character | Where the bean came from |
| farm_name | character | Name of the farm |
| lot_number | character | Lot number of the beans tested |
| mill | character | Mill where the beans were processed |
| ico_number | character | International Coffee Organization number |
| company | character | Company name |
| altitude | character | Altitude - this is a messy column - I’ve left it for some cleaning |
| region | character | Region where bean came from |
| producer | character | Producer of the roasted bean |
| number_of_bags | double | Number of bags tested |
| bag_weight | character | Bag weight tested |
| in_country_partner | character | Partner for the country |
| harvest_year | character | When the beans were harvested (year) |
| grading_date | character | When the beans were graded |
| owner_1 | character | Who owns the beans |
| variety | character | Variety of the beans |
| processing_method | character | Method for processing |
| aroma | double | Has both fragrance (ground beans) and aroma (hot water with coffee powder) |
| flavor | double | Flavor grade |
| aftertaste | double | Length of positive flavor remaining after the coffee is swallowed |
| acidity | double | The score depends on the origin characteristics and other factors(degree of roast) |
| body | double | Body grade |
| balance | double | Balance grade |
| uniformity | double | Refers to the consistency of flavor . 2 points are awarded for each cup displaying this attribute, with a maximum of 10 points if all 5 cups are the same. |
| clean_cup | double | Refers to a lack of interfering negative impressions from first ingestion to final aftertaste |
| sweetness | double | Sweetness grade |
| cupper_points | double | The cupper marks the intensity of the Aroma on a scale |
| moisture | double | Moisture Grade |
| category_one_defects | double | Full black or sour bean, pod/cherry, and large or medium sticks or stones(count) |
| quakers | double | Unripened beans that are hard to identify during hand sorting and green bean inspection |
| color | character | Color of bean |
| category_two_defects | double | Parchment, hull/husk, broken/chipped, insect damage, partial black or sour, shell, small sticks or stones, water damage(count) |
| expiration | character | Expiration date of the beans |
| certification_body | character | Who certified it |
| certification_address | character | Certification body address |
| certification_contact | character | Certification contact |
| unit_of_measurement | character | Unit of measurement |
| altitude_low_meters | double | Altitude low meters |
| altitude_high_meters | double | Altitude high meters |
| altitude_mean_meters | double | Altitude mean meters |
The aim of this report is to discover characteristics within best-graded coffee bean countries, and will examine from different aspects of coffee beans to explore the likely factors that influence its quality and taste.
Secondary question:
Which Country produces the best quality coffee beans, which regions perform better than others in the quality of the coffee beans produced, intra-country?
Using the regression model, explain how factors like altitude, processing method and defects affect the quality of the beans produced.
Which countries perform best on individual grading criteria such as aroma, acidity, sweetness etc?
The quality of coffee developed overtime. (NEW)
What is the common processing method of top-graded coffee beans, how different processing method behaves on the individual grading criteria of coffee? (NEW)
Coffee beans are harvested, produced and exported throughout almost every country in the world. This dataset contains the data of Ethiopia, Guatemala, Brazil, Peru, United States, United States (Hawaii), Indonesia, China, Costa Rica, Mexico, Uganda, Honduras, Taiwan, Nicaragua, Tanzania, United Republic Of, Kenya, Thailand, Colombia, Panama, Papua New Guinea, El Salvador, Japan, Ecuador, United States (Puerto Rico), Haiti, Burundi, Vietnam, Philippines, Rwanda, Malawi, Laos, Zambia, Myanmar, Mauritius, Cote d?Ivoire, NA, India. We will be focusing on the manufacturing and the quality aspect of the beans produced in this report. The two main variants of a coffee bean are Arabica and Robusta. Approximately 60% of coffee produced in the world is Arabica and approximately 40% is Robusta. Arabica beans consists about 0.8%-1.4% caffeine and Robusta beans consists of 1.7%-4% caffeine. Coffee is one of the most important cash crop in the world. Wikipedia
The Coffee Quality Institute is a non-profit organization that grades coffee samples from around the world in a consistent and professional manner.
The coffee beans are graded by the Coffee Quality Institute’s trained reviewers. The total rating of a coffee bean is a cumulative sum of 10 individual quality measures: aroma, flavour, aftertaste, acidity, body, balance, uniformity, clean cup, sweetness and cupper points. Each grade is on a 0–10 scale resulting to a total cupping score between zero and one hundred.
Figure 2.1 aims to address the primary question Which country produces best quality coffee beans?. X axis shows the country while Y axis denotes the overall rating achieved by the coffee bean. It is clear that Ethiopia produced the highest quality of coffee beans. However, it is interesting to note that there is not much variation between countries as most of them have median score of around 80-85 points. Thus, We can conclude that based on the dataset, there is not much difference in coffee quality between countries, with Ethipoia produces the highest-quality beans.
Figure 2.1: Boxplot for total ratings of coffee beans by country
The dataset contains information about the coffee sample’s altitude of origin, variety, processing method and color. In this section, I examine if any of these characteristics correlate with the total cupping score.
Here, I ran a model of the altitude of the sample and its total cupping score. The result is displayed in table ?? and figure 2.2. The p-value is lower then 0.05 which indicates that the model is statistically significant. The result shows a positive relationship between altitude and quality of coffee beans produced. However, it is worth to notice that the Adjusted R Square of the model is extremely small, only 0.022, or 2% (See Appendix 1), which suggest that the model is very weak. Therefore, I tried Pearson correlation coefficient to examine the relationship between altitude and cupping points. As suggested by table 2.2, there are no significant correlation.
There are some very interesting findings from 2.2. For example, it seems that the majority of coffee samples were grown 750-2000m above sea levels. Also, the most of coffee beans seems to score very high,around 80-85 points.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 81.09426 | 0.22743 | 356.56384 | 0 |
| altitude_mean_meters | 0.00082 | 0.00016 | 5.11726 | 0 |
Figure 2.2: Altitude and Total cup points dot plot, blueline presents fitted regression model, redlines indicate present altitude of 750m and 2000m.
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| -0.0201263 | -0.6694689 | 0.5033361 | 1106 | -0.0789258 | 0.0388127 | Pearson’s product-moment correlation | two.sided |
Processing Method v/s Quality
To check if the processing method affects the quality of coffee beans produced, we have taken help of the ANOVA test as the processing method is a categorical variable. The ANOVA test returns p-values very far away from the confidence interval of 5% which can be observed when we plot the residuals against the fitted values in Figure 1 . Hence it is established that the processing methods used in producing the coffee beans does not influence the quality of beans produced.
## Df Sum Sq Mean Sq F value Pr(>F)
## method 4 58 14.468 1.978 0.0956 .
## Residuals 1164 8514 7.314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 169 observations deleted due to missingness
Figure 2.3: ANOVA test for processing methods vs Quality
Defects v/s Quality
After the linear model for altitude turned out to be insignificant, we figured there are several other variables in the dataset that we could try fitting a model. The dataset contains category one and category two defects which are also known as primary and secondary defects and we fitted a muti-variate model using the same. The model after considering both the variables return a p-value very close to 0 and hence this model is considered a good one and as can be seen in Figure 2 which suggest that almost all the residuals reside very close to the 0 line with a very few outliers. Thus it can be understood that defects influence the quality of coffee beans produced.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 82.5885115 | 0.1122747 | 735.593003 | 0.00000 |
| category_one_defects | -0.1106967 | 0.0378995 | -2.920796 | 0.00355 |
| category_two_defects | -0.1252918 | 0.0181894 | -6.888192 | 0.00000 |
To check which criteria the top 5 countries perform best in we have used radar charts from Figure 3 onwards. A radar chart is a useful way to depict multi-variate observations. Each criteria is rated out of a total 10 points and all the 10 criteria are plotted together on the radar chart along with moisture percentage to understand how a particular country performs on individual criteria. The top-5 coffee bean producing countries according to our analysis are Ethiopia, Brazil, United States, Indonesia and Peru.
| country_of_origin | ma | mfl | maf | mac | mb | mba | mu | mc | ms | mcu | mm |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Brazil | 7.606667 | 7.548182 | 7.363939 | 7.464545 | 7.532727 | 7.543030 | 9.757273 | 9.69697 | 9.939091 | 7.497576 | 0.0803030 |
| Ethiopia | 8.001429 | 8.154286 | 7.892857 | 8.154286 | 7.930000 | 8.012857 | 9.904286 | 10.00000 | 10.000000 | 8.141429 | 0.0885714 |
| Indonesia | 7.682000 | 7.416000 | 7.200000 | 7.214000 | 7.600000 | 7.230000 | 9.866000 | 10.00000 | 9.866000 | 7.268000 | 0.0700000 |
| Peru | 7.446667 | 7.333333 | 7.223333 | 7.386667 | 7.530000 | 7.446667 | 9.776667 | 10.00000 | 10.000000 | 7.306667 | 0.1100000 |
| United States | 7.790000 | 7.875000 | 7.670000 | 7.875000 | 7.790000 | 7.670000 | 9.665000 | 9.66500 | 8.710000 | 7.835000 | 0.0000000 |
After looking at these plots, the conclusion drawn are as follows: The common characteristics that these top 5 countries have are the consistent higher values of uniformity and clean cup. Among all these countries, it can be seen that the country Ethiopia has the highest values for all the different characteristics that we have proven to have a significant affect on the quality of the coffee beans in the above sections. It is also interesting how the sweetness has a perfect score of 10 in all other countries other than United States as depicted.
Processing Method amongst best-graded coffee beans
Figure 2.4: Common processing method of top 30 coffee beans by country
| Processing Method | Frequency |
|---|---|
| Natural / Dry | 11 |
| Pulped natural / honey | 1 |
| Semi-washed / Semi-pulped | 1 |
| Washed / Wet | 17 |
The barchart 2.4 demonstrates the frequency of processing method in top 30 coffee bean producers, we discovered that the producers in most countries is using washed/wet method to produce coffee beans, followed by natural/dry method. In addition, majority of coffee bean producers in Ethiopia, the country that produces highest quality coffee beans, is using the natural/dry method. This is explained by Korhonen (2020), “The natural process is common in regions where there is no access to water such as Ethiopia and some regions in Brazil.”
Behaviour of different processing method on the individual grading criteria of coffee
We picked 6 grading criteria, according to the coffee scoring article provided on mycuppa:
Figure 2.5: Scatterplot matrix of grading criterias, differentiated by processing method
The scatterplot matrix 2.5 examines the relationship between each grading criteria, and is differentiated by the processing method.
The scatterplot matrix demonstrates the correlation of grading criteria are positive to various extent.
The density plot for acidity located on 1st row & 1st column indicates the method “Pulped natural/honey” have relatively low density compared to other processing method.
One thing to notify is that there are presence of outliers for “Washed/Wet” method in aroma grade (on 3rd column).
A bunch of coffee beans processed with “Washed/Wet” method have relatively lower aftertaste grade.
Overall, we can observe that the “Washed/Wet” method have relatively more variations compared to other method, and the plot 2.5 also indicates the primary processing method are “Washed/Wet” and “Natural/Dry” across all coffee bean producers.And we can barely capture the appearance of other two methods.
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0232 0.0223 2.60 26.2 3.66e-7 1 -2620. 5246. 5261.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The dataset was taken from (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-07/readme.md)
Further inferences were drawn into the data taking help from (https://database.coffeeinstitute.org/coffee/357789/grade)
Cerda, R., Allinne, C., Gary, C., Tixier, P., Harvey, C. A., Krolczyk, L., … & Avelino, J. (2017). Effects of shade, altitude and management on multiple ecosystem services in coffee agroecosystems. European Journal of Agronomy, 82, 308-319.
The data is in: Ethiopia has the best coffee. (2020). Retrieved 27 August 2020, from https://towardsdatascience.com/the-data-speak-ethiopia-has-the-best-coffee-91f88ed37e84
En.wikipedia.org. 2020. Coffee Bean. [online] Available at: https://en.wikipedia.org/wiki/Coffee_bean [Accessed 27 August 2020].
Korhonen, Jori. 2020. “Coffee Processing Methods – Drying, Washing or Honey?” Barista Institute. https://www.baristainstitute.com/blog/jori-korhonen/january-2020/coffee-processing-methods-drying-washing-or-honey.